Checkpoint/Restore test fixes#11955
Conversation
|
@edsantiago PTAL |
|
My only concern is that perhaps the 3-second timeout (w/ 5 retries) is too long. However, the code also isn't waiting for a 'ready' signal from the container (can it?), and the |
ee739d7 to
40ef16c
Compare
|
Switched to |
|
LGTM |
40ef16c to
b33845f
Compare
|
/lgtm |
|
/hold |
|
/hold cancel |
jwhonce
left a comment
There was a problem hiding this comment.
Issues I see are mostly modifying a known good port. Those new ports could be used elsewhere which cause CI flakes. Generate all "new" port numbers using GetRandomPort()
test/e2e/checkpoint_test.go
Outdated
| fmt.Fprintf(os.Stderr, "Trying to connect to redis server at localhost:%d", orginalPort) | ||
| // Open a network connection to the redis server via initial port mapping | ||
| conn, err := net.DialTimeout("tcp4", fmt.Sprintf("localhost:%d", orginalPort), time.Duration(3)*time.Second) | ||
| Expect(err).To(BeNil()) |
There was a problem hiding this comment.
| Expect(err).To(BeNil()) | |
| Expect(err).ShouldNot(HaveOccurred()) |
test/e2e/checkpoint_test.go
Outdated
| } | ||
| fmt.Fprintf(os.Stderr, "Trying to reconnect to redis server at localhost:%d", orginalPort+100) | ||
| conn, err = net.DialTimeout("tcp4", fmt.Sprintf("localhost:%d", orginalPort+100), time.Duration(3)*time.Second) | ||
| Expect(err).To(BeNil()) |
There was a problem hiding this comment.
| Expect(err).To(BeNil()) | |
| Expect(err).ShouldNot(HaveOccurred()) |
test/e2e/checkpoint_test.go
Outdated
| localRunString := getRunString([]string{"-p", "1234:6379", "--rm", redis}) | ||
| port, err := utils.GetRandomPort() | ||
| Expect(err).ShouldNot(HaveOccurred()) | ||
| orginalPort := port + GinkgoParallelProcess() |
There was a problem hiding this comment.
I'm confused, you get a random port (that is "safe") and then add something to it?
test/e2e/checkpoint_test.go
Outdated
| Fail("Container failed to get ready") | ||
| } | ||
|
|
||
| fmt.Fprintf(os.Stderr, "Trying to connect to redis server at localhost:%d", orginalPort) |
There was a problem hiding this comment.
I think so, the test was hanging just after this point, so knowing what it was trying to do may also be helpful in the future. Ginkgo doesn't give us this detail otherwise 😕
There was a problem hiding this comment.
Exactly. It was not added for debugging just for better feedback during the test to know what is happening.
test/e2e/checkpoint_test.go
Outdated
|
|
||
| // Restore container with different port mapping | ||
| result = podmanTest.Podman([]string{"container", "restore", "-p", "1235:6379", "-i", fileName}) | ||
| result = podmanTest.Podman([]string{"container", "restore", "-p", fmt.Sprintf("%d:6379", orginalPort+100), "-i", fileName}) |
There was a problem hiding this comment.
Again, adding a number to a "safe" port?
test/e2e/checkpoint_test.go
Outdated
| // Open a network connection to the redis server via initial port mapping | ||
| // This should fail | ||
| conn, err = net.Dial("tcp", "localhost:1234") | ||
| conn, err = net.DialTimeout("tcp4", fmt.Sprintf("localhost:%d", orginalPort), time.Duration(3)*time.Second) |
Oooohhhh, I understand. So the concern is if this test goes south/hangs/leaks, another test might also ask for a port and receive the same one (since it was |
Moving to Fedora 35 showed test failures (time outs) in the test
"podman checkpoint and restore container with different port mappings"
The test starts a container and maps the internal port 6379 to the local
port 1234 ('-p 1234:6379') and then tries to connect to localhost:1234
On Fedora 35 this failed and blocked the test because the container was
not yet ready. The test was trying to connect to localhost:1234 but
nothing was running there. So the error was not checkpointing related.
Before trying to connect to the container the test is now waiting for
the container to be ready.
Another problem with this test and running ginkgo in parallel was that
it was possible that the port was already in use. Now for each run a
random port is selected to decrease the chance of collisions.
Signed-off-by: Adrian Reber <areber@redhat.com>
b33845f to
8439a6d
Compare
|
@jwhonce Thanks for the review. I adapted the PR to fix what you commented on. Besides the debug output which was added to know that the test is trying to connect to the container. I can remove it but I think it can be helpful understanding what is going on. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianreber, cevich, jwhonce The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thanks everyone. |
Moving to Fedora 35 showed test failures (time outs) in the test
"podman checkpoint and restore container with different port mappings"
The test starts a container and maps the internal port 6379 to the local
port 1234 ('-p 1234:6379') and then tries to connect to localhost:1234
On Fedora 35 this failed and blocked the test because the container was
not yet ready. The test was trying to connect to localhost:1234 but
nothing was running there. So the error was not checkpointing related.
Another problem with this test and running ginkgo in parallel was that
it was possible that the port was already in use. Now for each run a
random port is selected to decrease the chance of collisions.
This tries to fix problems seen in #11795
CC: @cevich